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The evaluation of a match between the DNA profile of a stain 
found on a crime scene and that of a suspect (previously identified) 
involves the use of the unknown parameter p = (pi,p2, (the or¬ 
dered vector which represents the frequencies of the different DNA 
profiles in the population of potential donors) and the names of the 
different DNA types. We propose a Bayesian nonparametric method 
which models p through a random variable P distributed according 
to the two-parameter Poisson Dirichlet distribution, and discards the 
information about the names of the different DNA types. The ulti¬ 
mate goal of this model is to evaluate the so-called ‘probative value’ 
of DNA matches in the rare type case, that is the situation in which 
the suspect’s profile, matching the crime stain profile, is not in the 
database of reference. 


1. Introduction. The largely accepted method for evaluating how much some avail¬ 
able data V (typically forensic evidence) is helpful in discriminating between two hy¬ 
potheses of interest (the prosecution hypothesis Hp and the defense hypothesis Rd), is the 
calculation of the likelihood ratio (LR), a statistic that expresses the relative plausibility 
of the data under these hypotheses, defined as 


_ Pr(P|Rp 
Vx{V\Hd) 


Widely considered the most appropriate framework to report a measure of the ‘proba¬ 
tive value’ of the evidence regarding the two hypotheses (Robertson and Vignaux, 1995; 
Evett and Weir, 1998; Aitken and Taroni, 2004; Balding, 2005), it indicates the extent to 
which data is in favor of one hypothesis over the other. Forensic literature presents many 
approaches to calculate the LR, mostly divided into Bayesian and frequentist methods (see 
Cereda (20156) for a careful differentiation between these two approaches). 

This paper proposes a Bayesian nonparametric method for the LR assessment in the 
rare type match case, the challenging situation in which there is a match between some 
characteristic of the recovered material and of the control material, but this characteris¬ 
tic has not been observed yet in previously collected samples (i.e. database of reference). 
This constitutes a problem because the LR value depends on the proportion of the match¬ 
ing characteristic in a reference population, and this proportion is, in standard practice, 
estimated using the relative frequency of the characteristic in the available database. In 
particular, we will focus on Y-STR DNA profile matches, for which the rare type match 
problem is often recurring (Cereda, 20156). In this case data to evaluate is made of the 
information about the matching profile and of the list of DNA prohle in the database. 

The use of a Bayesian nonparametric method is justified by the fact that the parameter 
of the model is the infinite dimensional vector p, made of the (unknown) sorted population 
proportions of all possible Y-STR profiles, assumed to be infinitely many. As prior over 
p we choose the two parameter Poisson Dirichlet distribution, and treat its parameters 
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as hyperparameters, hence provided with a hyperprior. Moreover, we will discard the 
information contained in the names of the profiles, and this will lead to a reduction of 
the data P to a smaller amount of information D. The reduction of the data can be a 
wise practice in presence of many nuisance parameters as explained in Cereda (20156), 
and sometimes the likelihood ratio based on the data reduction is much more precisely 
estimated than the likelihood ratio based on all data. 

The paper is structured in the following way: Section 2 introduces the notation, the 
assumptions of our model and the prior distribution chosen for parameter p. Section 3 
displays the model, via Bayesian network representation, along with some theory on ran¬ 
dom partitions useful to define a clever and compact representation of the reduced data 
D. An alternative representation of the same model via the Chinese restaurant process is 
also described. Section 4 introduces relevant known results regarding the two parameter 
Poisson Dirichlet distribution, along with a new lemma, which can be used for all the 
situations in which prosecution and defense agree on the distribution of part of the data 
and disagree on the distribution of the rest, given the parameter(s). This result will allow 
to derive the LR in a very elegant way (Section 5). Section 6 displays some experiments 
of application of this model on a real database of Y-STR profiles, such as model fitting, 
asymptotic power law behavior, study of the loglikelihood function, and comparison with 
the LR values obtained in the ideal situation in which vector p is known. Lastly, Section 7 
proposes questions for future research. 

2. A Bayesian nonparametric model for the rare type match. 

2.1. The rare type match problem. In order to evaluate the match between the profile 
of a particular piece of evidence and a suspect’s profile, it is necessary to estimate the 
proportion of that profile in the population of potential perpetrators. Indeed, it is intuitive 
that the rarer the matching profile, the more the suspect is in trouble. Problems arise when 
the observed frequency of the profile in a sample from the population of interest (i.e., in 
a reference database) is 0. Such characteristic is likely to be rare, but it is challenging to 
quantify how rare it is. The rare type problem is particularly important in case a new 
kind of forensic evidence, such as results from DIP-STR markers (see for instance Cereda 
et ah (2014)) is involved, and for which the available database size is still limited. The 
same happens when Y-chromosome (or mitochondrial) DNA profiles are used since the 
set of possible Y-STR profiles is extremely large. As a consequence, most of the Y-STR 
haplotypes are not represented in the database. The Y-STR marker system will thus be 
retained here as an extreme but in practice common and important way in which the 
problem of assessing evidential value of rare type match can arise. This problem is so 
substantial that it has been defined “the fundamental problem of forensic mathematics” 
(Brenner, 2010). 

The empirical frequency estimator, also called naive estimator, that uses the frequency of 
the characteristic in the database, puts unit probability mass on the set of already observed 
characteristics, and it is thus unprepared for the observation of a new type. A solution 
could be the add-constant estimators (in particular the well known add-one estimator, due 
to Laplace (1814), and the add-half estimator of Krichevsky and Trohmov (1981)), which 
add a constant to the count of each type, included the unseen ones. However, this method 
requires to know the number of possible unseen types, and it is also not very performing 
when this number is large compared to the sample size (see Gale and Church (1994) for 
additional discussion). Alternatively, Good (1953), based on an intuition on A.M. Turing, 
proposed the Good Turing estimator for the total unobserved probability mass, based on 
the proportion of singleton observations in the sample. An extension of this estimator is 
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applied to the LR assessment in the rare type match in Cereda (20156). For a comparison 
between add one and Good-Turing estimator, see Orlitsky et al. (2003). As pointed out 
in Anevski et al. (2013), the naive estimator, and the Good Turing estimator are in some 
sense complementary: the first gives a good estimate for the observed types, and the second 
for the probability mass of the unobserved ones. More recently, Orlitsky et al. (2004) have 
introduced the high profile estimator, which extends the tail of the naive estimator to the 
region of unobserved types. Anevski et al. (2013) improved this estimator and provided 
the consistency proof. Papers that address the rare Y-STR haplotype problem in forensic 
context are for instance Egeland and Salas (2008), Brenner (2010), and Cereda (2015a), 
which applied classical Bayesian approach (the beta binomial and the Dirichlet multinomial 
problem) to the LR assessment in the rare haplotype case. Moreover, the Discrete Laplace 
method presented in Andersen et al. (2013), even though not specifically designed for the 
rare type case can be successfully applied to that extent Cereda (20156). 

Bayesian nonparametric estimators for the probability of observing a new type have been 
proposed by Tiwari and Tripathi (1989), using Dirichlet process, by Lijoi et al. (2007) 
using general Gibbs prior, and by Favaro et al. (2009) with specific interest to the two 
parameter Poisson Dirichlet prior. However, for the LR assessment it is required not only 
the probability of observing a new species but also the probability of observing this same 
species twice (according to the defense the crime stain profile and the suspect profile are 
two independent observations), and to our knowledge, this paper is the first one to address 
the problem of LR assessment in the rare haplotype case using Bayesian nonparametric 
models. As prior we will use the Poisson Dirichlet distribution, which is proving useful in 
many discrete domain, in particular language modelling (Teh et ah, 2006). In addition, it 
shows a power law behaviour which describe a incredible variety of phenomena (Newman, 
2005). 

2.2. Notation. Throughout the paper the following notation is chosen: random vari¬ 
ables and their values are denoted, respectively, with uppercase and lowercase characters: 
X is a realization of X. Random vectors and their values are denoted, respectively, by 
uppercase and lowercase bold characters: p is a realization of the random vector P. Prob¬ 
ability is denoted with Pr(-), while density of a continuous random variable X is denoted 
alternatively by px{x) or by p{x) when the subset is clear from the context. For a discrete 
random variable Y, the density notation pyiu) and the discrete one Pr(y = y) will be 
alternately used. Moreover, we will use shorthand notation like p{y \ x) to stand for the 
probability density of Y with respect to the conditional distribution of Y given X = x. 

Notice that when using the notation in (1), D is regarded as events. However, later in 
the paper, it will be regarded as a random variables. In that case, the following notation 
will thus be preferred: 

,,, Pr(P = dlH,) p(d\H,) 

Pr(C = <!|/;,) p(d\HX 

Lastly, notice that “DNA types” is used throughout the paper as a general formula to 
indicate Y-STR profiles. 

2.3. Model assumptions. Our model is based on the two following assumptions: 
Assumption 1 There are infinitely many DNA types in Nature. 

The reason for this assumption is that there are so many possible DNA types that they can 
be considered infinite. This assumption, already used by e.g. Kimura (1964) in the ‘infinite 
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alleles model’, allows to use Bayesian nonparametric methods and avoids the problem of 
specifying how many different types there are in Nature. 

Assumption 2 The names of the different DNA types do not contain information. 

The specific sequence of numbers that forms a DNA profile carries information: if two 
profiles show few differences this means that they are separated by few mutation drifts, 
hence the profiles share a relatively recent common ancestor. However, this information is 
difficult to exploit and may be not so relevant for the LR assessment. This is the reason 
why we will treat DNA types as “colors”, and only consider the repartition into different 
categories. Stated otherwise, we put no topological structure on the space of the DNA 
types. 

Notice that this assumption makes the model a priori suitable for any characteristic 
which shows many different possible types, thus what written still holds, in principle, also 
replacing ‘DNA types’ with any other type. However, we will only test the model with 
Y-STR data. 

2.4. Prior. In Bayesian statistics, parameter(s) of interest are modeled through ran¬ 
dom variables. The (prior) distribution over the parameter(s) should represent the uncer¬ 
tainty about its (their) value(s). 

LR assessment for the rare type match involves two unknown parameters of interest: 
one is /i G {Hp, Lid}; representing the unknown true hypothesis, the other is p, the vector 
of the unknown population frequencies of all DNA profiles in the population of potential 
perpetrators. The dichotomous random variable H is used to model parameter h, and 
the posterior distribution of this random variable, given data, is the ultimate aim of a 
forensic inquiry. In a similar way, random variable P can be used to model p. Because 
of Assumption 1, p is an infinite dimensional parameter, hence the need of Bayesian non¬ 
parametric methods (Hjort et ah, 2010). In particular p = {pt\t G T), with T a countable 
set of indexes, pt > 0, and YltPt ~ Moreover, because of Assumption 2, data will be 
reduced to random partitions, as explained in Section 3.1, and it will turn out that the 
distribution of these partitions does not depend on the order of the pi. Hence, we can force 
the parameter p to have values in Voo = {{pi,P 2 , --OIpi ^ P 2 > ■■■, > 0}, the 

ordered infinite dimensional simplex. The ordered random vector p describes an infinite 
population randomly partitioned into DNA types. The randomness is described by the 
prior distribution over p, for which we choose the two-parameter Poisson Dirichlet dis¬ 
tribution (Pitman and Yor, 1997; Feng, 2010; Buntine and Hutter, 2010; Carlton, 1999; 
Pitman and Picard, 2006), defined in the following way: 

Definition 1 (two parameter GEM distribution). Given a and 9 satisfying the fol¬ 
lowing conditions: 

(3) 0 < a < 1, and 6 > —a. 

the vector W = (Wi, W 2 , ■■■) is said to be distributed according to the GEM(a:, 9), if 

2—1 

Vi W^ = ViYl{l-Vj), 
i=i 

where Vi, V 2 ,... are independent random variables distributed according to 

Vi ~ B{1 — a,9 + ia). 

It holds that Wi > 0, and J2i kPi = 1- 
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The GEM distribution (short for Griffin - Engen - McCloskey distribution’) is well known 
in literature as the “stick breaking prior”, since it measures the random sizes in which a 
stick is broken iteratively. This distribution is invariant under size biased permutation 
(Engen, 1975), the random permutation defined by sampling from the population and 
assigning to each type a label, based on the order in which it is first sampled. 

Definition 2 (Two parameter Poisson Dirichlet distribution). Given a and 9 sat¬ 
isfying condition (3), and a vector W = iWi,W 2 , ...) ~ GEM{a,9), the random vector 
P = {Pi,P 2 ,...) obtained by ordering W, such that pi > Pi+i, is said to be Poisson 
Dirichlet distributed PD{a,9). Parameter a is called discount parameter, while 9 is the 
concentration parameter. 

Notice that the vector P is obtained by sorting the vector W in non increasing order, 
while the vector W can be obtained by the so-called size biased permutation of the indexes 
of P (Perman et ah, 1992; Pitman and Yor, 1997). 

The two parameter Poisson Dirichlet distribution PD(a,0) is the generalization of the 
well known Poisson Dirichlet distribution with a single parameter 9 introduced by Kingman 
(1975), which is the representation measure (Kingman, 1977, 1978) of the celebrated Ewens 
sampling formula (Ewens, 1972), widely applied in genetics (Karlin and McGregor, 1972; 
Kingman, 1980). Eor our model we will not allow a = 0, hence we will assume 0 < a < 1. 

It is worth mentioning that an alternative choice for the parameters space is a < 0, 
9 = —ma for some m G N (Pitman, 1996; Gnedin and Pitman, 2006; Gnedin, 2010; 
Cerquetti, 2010). It corresponds to a model with finitely many (m) DNA types, where the 
prior over P = (Pi,..., Pm) is Dirichlet with m parameters equal to —a. 

Lastly, we point out that, in practice, we cannot assume to know parameters a and 9: 
this is why we will treat them as hyperparameters on which we will put an hyperprior. 



Fig 1. Bayesian network to show the eonditional dependencies of the relevant random variables in our 
model. 


3. The model. The Bayesian network of Eigure 1 encapsulates the conditional depen¬ 
dencies of the variables of the proposed model. They are defined through random variables 
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defined as follows: 

• H is a dichotomous random variable that represents the hypotheses of interest and 
can take values h G {Hp,Hd}, according to the prosecution or the defense, respec¬ 
tively. A uniform prior on the hypotheses is chosen: 

Pj:(H = h) (X I for h = {Hp,Hj}. 

• (A, 0) is the random vector that represents the hyperparameters a and 0, satisfying 
condition (3). The joint distribution of these two parameters (hyperprior) will be 
generically denoted as p{a,0): 


(A,0) ^p{a,0). 

• The random vector P with values in Vqo, represents the ranked population frequen¬ 
cies. P = p = {pi,P 2 , ■••) means that pi is the frequency of the most common DNA 
type in the population, p 2 is the frequency of the second most common DNA type, 
and so on. As a prior for P we use the two-parameter Poisson Dirichlet distribution 
as in Definition 2: 

P|(A,0) = {a,0) ~ PD{a,0). 

• Integer valued random variables Xi, ..., Xn represent the ranks of the population 
proportions of the DNA types of the individuals in the database (after some arbitrary 
ordering for profiles in the database is chosen). For instance, X 3 = 5 means that 
the third individual in the database has the fifth most common DNA type in the 
population. Since p is unknown these random variables cannot be observed. Given 
p they are an i.i.d. sample from p: 

(4) Xi,X2,...,Xn\P = P^i.i.d.P- 

• A„+i represents the rank of the suspect’s DNA type. It is again a draw from p. 

Xn+l\P =pr^p. 

• A „+2 represents the rank of the crime stain’s DNA type. According to the prose¬ 
cution, given Xn+i, this random variable is deterministic (it is equal to Xn+i with 
probability 1). According to the defense it is another sample from p: 

I f if h = H„ 

Xn+ 2 \P=P,Xn+l=Xn+l,H = hr^ { 

[p if h = Hd 

As already mentioned, Xi,Xn +2 can not be observed. They represent the database 
ranked according to the unknown rank in p and constitute an intermediate layer that helps 
in expressing the data in terms of observable partitions. Section 3.1 recalls some notions 
about random partitions, useful before defining node D, representing the ‘reduced’ data 
we want to evaluate. 

3.1. Random partitions. A partition of a set A is an unordered collection of nonempty 
and disjoint subsets of A the union of which forms A. Particularly interesting for our model 
are partitions of the set A = [n] = {1, ...,n}, denoted as vrj^]. The set of all partitions of 
[n] will be denoted as V^n]- Random partitions of [n] will be denoted as IIj^j. In addition, 
a partition of n is a finite non increasing sequence of positive integers that sum up to n. 
Partitions of n will be denoted as TTn- 
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Given a sequence of integer valued random variables Xi, ...,Xn, let n[„](Xi, X 2 , Xn) 
be the random partition defined by the equivalence classes of their indexes using the 
random equivalence relation i ~ j iff Xj = Xj. This construction allows to build a map 
from the set of values of Xi, to the set of the partitions of [n] as in the following 

example (n = 10): 


^ iP[io] 

-^ 1 , ■■■,Xio I—)• n[io](Xi,X2, ...,Xio) 

(3,1,3,1, 2, 2, 6, 9,4,1) ^ {{1, 3}, {2,4,10}, {5,6}, {7}, {8}, {9}} 

Typical data to evaluate in case of a match is 27 = where E = {Es,Et), and 

• B = the database of size n, which contains a sample of DNA types, indexed by 
i = 1,..., n, from the population of possible perpetrators, 

• Eg = suspect’s DNA type, 

• Et = crime stain’s DNA type (matching with the suspect’s type). 

In agreement with Assumption 2, we can consider the reduction of data which ignores 
information about the names of the DNA types: this is achieved, for instance, by retaining 
only the equivalence classes of the relation “to have the same DNA type”. 

The database can thus be reduced to the partition of [n], denoted and obtained 
using the equivalence classes of the indexes. Notice that the same partition is obtained via 
random variables Xi, ...,X„, as defined in (4). Stated otherwise, we can reduce B to 
the partition of [n] obtained from the database. However, data is actually made of the 
background data along with the evidence, two new observations that match. In a similar 
way, when the suspect profile is considered we obtain the partition where the first 

n integers are partitioned as in and n + 1 constitutes a single subset (at least in the 
rare type match case). 

When the crime stain profile is considered we obtain the partition where the 

first n integers are partitioned as in TTpp and n + 1 and n + 2 belongs to the same (new) 
subset. 

Random variables Hpp and are used to model vr^p and 

respectively. 

Notice that, given a and 6 , prosecution and defense agree on the distribution of n|^^] 
but disagree on the distribution of . 

It is worth noticing that, by construction, the same random partitions can be defined 
through random variables Xi, ..., Xn+ 2 ' 

^[n] (^1) •••) ^n)) 

n[n+l](-^l, ■■■,Xn+l), 
n[n+2](-^l) •••,^n+2)- 


npb = 


nDb+ _ 


n 


Db++ 

n+2] 


To clarify, consider the following example of a database (Db) with k = 6 different DNA 
types, from n = 10 individuals: 


Db — (hi, /i 2 , hi, /i 2 , /i 3 , /i 3 , /i 4 , / 15 , /le, / 12 ), 
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where hi is the name of the ith DNA type in the order chosen for the database. This can 
be reduced to the partition of [ 10 ]: 

= {{1> 3}, {2,4,10}, {5, 6 }, {7}, { 8 }, {9}}. 

Then, the part of data prosecution and defense agree on is 

>r“+ = {{l,3),{2,4.10},{5.6},{7}.{8},{9),{n}), 

while the entire data D can be represented as 

7rg++ = {{1, 3}, {2, 4,10}, {5, 6 }, {7}, { 8 }, {9}, {11,12}}. 

Now, assume that p is known, thus we know also that hi is, for instance, the fourth 
most frequent type, /i 2 is the second most frequent type, and so on. Stated otherwise, we 
are now able to observe the variables Xi, ..., X„_|_ 2 : Xi = 4, X 2 = 2, A 3 = 4, A 4 = 2, 
As = 3, Ae = 3, A 7 = 10, As = 13, Ag = 5, Aio = 2, An = 9, A 12 = 9. It is easy to check 
that n[„](Ai,..., A„) = Trgf, n[,+q(Ai,..., A„+i) = 7 r° 5 _+, n[„+ 2 ](^i, •.•,^n+ 2 ) = ■ 

Data D can now be defined as: 

• D = the partition of [n + 2] obtained from the database enlarged with the 

two new observations. 

Node D of Figure 1 is defined accordingly. Notice that, given Ai,..., A„+ 2 ) D is deter¬ 
ministic. A very relevant result is that, according to Proposition 4 in Pitman (1992) it is 
possible to describe directly the distribution of D \ a, 6, H. Hence, we can get rid of the 
intermediate layer of nodes Ai, ..., A„_|_ 2 . In particular, it holds that if 

P I a,e ^ PD{a,9), 


and 

Ai, A2 ,... \ P = p ~i,i,d p, 

then, for all n, the random partition Hj^] = n[„](Ai,..., A„) has the following distribution: 

k 


(5) 


3 ) 0,0 


(7r[n]) := Pr(n[„] = 7r[„]|Q;,6») = 


[6 + a\k-i-^c 

[9 + l]n-l;l 

•' ’ 2 = 1 


]^[1 


where n, is the size of the ith block of ttj^] (the blocks are here ordered according to the 


least element), and Vx, 6 G M, a G N, [x\a,b '■= 


Wi=l{x + ib) ifaGN\{0} 


. This formula 

^0 if a = 0 

is also known as the Pitman sampling formula, further studied in Pitman (1995). The 
model of Figure 1 can thus be simplified (see Figure 2). 

It holds that Pv{D\a,9,Hp) = 1 (vr^^+j), and Pv{D\a,9,Hd) = 

Notice that for a = 0 we obtain the Ewens’s sampling formula. 


3.2. Chinese Restaurant representation. There is an alternative characterization of this 
model, called “Chinese restaurant process”, due to Aldous (1985) for the one parameter 
case, and studied in details for the two parameter version in Pitman and Picard (2006). 
It is defined as follows: consider a restaurant with infinite tables, each one infinitely large. 
Let Yi,Y 2 ,... be integer valued random variables that represent the seating plan: tables 
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Fig 2. Simplified version of the Bayesian network in Figure 1 


are ranked in order of occupancy, and Yi = j means that the ith customer seats at the jth 
table to be created. The process is described by the following transition matrix: 

Fi = 1 


( 6 ) 


Pr(y„+i = 


' 0 + ka 
n + 9 

< 

Hi — a 
. n + 0 


if i = A: + 1 

if 1 < i < k 


where k is the number of tables occupied by the hrst n customers, and n* is the number 
of customers that occupy table i. 

Yi, ...,Yn are not i.i.d., nor exchangeable, but it holds that n[„](yi, ...,Yn) has the same 
distribution as ...,Xn)-, with Xi^...^Xn defined as in (4) (in particular they are 

distributed according to the Pitman sampling formula (5)). 

Stated otherwise, we can use the seating plan of n customers to obtain the same partition 
built through the database (or by partitioning Xi,Xn)- Then is obtained 

when a new customer has chosen an unoccupied table (remember we are in the rare type 
case), and is obtained when the n + 2nd customer goes to the same table of the 

n + 1st (suspect and crime stain have the same DNA type). In particular, thanks to (??), 
we can write 


(7) 

anc 

( 8 ) 


p(7r 


Db++ 

[n+2] 


Hr, 


P) 


p{-K 


Db++ 

[n+2] 


Hd,'^ 


Db+ 

[n+1] 


,a,e) 


1 — a 
n + 1 + 6 


since the n + 2nd customer goes to the same table of the n + 1st where he seats alone. 


4. Some results. This section presents some useful results that will be used in the 
forthcoming sections. In particular. Lemma 4.1, suitable to broader applications, is here 
applied to simplify the LR development. Then, some results from Pitman and Picard 
(2006) regarding the two parameter Poisson Dirichlet distribution, are listed. 

4.1. A useful Lemma. The following lemma is a result regarding four general random 
variables A, X, Y, H whose conditional dependencies are described by the Bayesian net¬ 
work of Figure 4.1. The importance of this result is due to the possibility of applying 
it to a very common forensic situation: the prosecution and the defense disagree on the 
entirety of data (y) but agree on a part of it (X) (indeed, as already noticed, defense 
and prosecution agree on the distribution of > but not on the distribution of ) ■ 

Data depends on parameters (A). 
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Fig 3. Conditional dependencies of the random variables of Lemmaf.,! 


Lemma 4.1. Given four random variables A, H , X and Y , whose conditional depen¬ 
dencies are represented by the Bayesian network of Figure 4-1, the likelihood function for 
h, given X = x and Y = y satisfies 

lik(/i \ x,y) oc ^{p{y \ x, A, h) \ X = x). 

Proof. The model of Figure 4.1 represents four variables A, H, X and Y whose joint 
probabilty density can be factored as 

p{a,h,x,y) = p{a)p{x\ a)p{h)p{y\ x,a,h). 

By Bayes formula, p{a)p{x \ a) = p{x)p{a \ x). This rewriting corresponds to reversing 
the direction of the arrow between A and X: 



The random variable X is now a root node. This means that when we probabilistically 
condition on X = x, the graphical model changes in a simple way: we can delete the node 
X, but just insert the value x as a parameter in the conditional probability tables of the 
variables A and Y which formerly had an arrow from node X. The next graph represents 
this model: 



This tells us, that conditional on X = x, the joint density of X, Y and H is equal to 

p{a I x)p{h)p{y I x,a,h). 
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The joint density of H and Y is obtained by integrating out the variable a. It can be 
expressed as a conditional expectation value, since p{a \ x) is the density of A given 
X = X. We find: 

p{h)'E{p{y \ x,A,h) \ X = x). 

Recall that this is the joint density of two of our variables, H and Y, after conditioning 
on the value X = x. Let us now also condition on T = y. It follows that the density of H 
given X = X and Y = y \s proportional (as function of H, for fixed x and y) to the same 
expression, p{h)E{p{y \ x, A, h) \ X = x). 

This is a product of the prior for h with some function of x and y. Since posterior odds 
equals prior odds times likelihood ratio, it follows that the likelihood function for h, given 
X = x and Y = y satisfies 

lik(h \ x,y) oc E{p{y \ x, A, h) \ X = x). 


□ 

Corollary 4.1. Given four random variables A, H, X and Y, whose conditional 
dependencies are represented by the network of Figure 4-1, the likelihood ratio for H = hi 
against H = h 2 given X = x and Y = y satisfies 

/qN rp_ Hp{y\x,A,hi)\X = x) 

^ ^ E(p(y|x,A,/.2)|W = x)- 

4.2. Known results about the two parameter Poisson Dirichlet distribution. We will 
now list some theoretical results which will be useful in the forthcoming analysis. Most of 
these results can be found in Pitman and Picard (2006). 

Denote as Kn the random number of blocks of a partition !![„] distributed according to 
the Pitman sampling formula with parameters a and 6 . 

• It exists a positive random variable Sa such that 

(10) lim —- = Sa a.s. 

n—>+oo n°‘ 

the distribution of Sa is a generalization of the Mittag LefHer distribution (Gorenflo 
et ah, 2014). 

• If P~PD(a,6 l), then 

(11) Pi —^ a.s., when i —^ Too 

for a random variable Z such that = T{1 — a)/Sa- 

• For a fixed a G (0,1), the PD(q!, 0) (for different 6 ) are all mutually absolutely 
continuous. This means that 9 cannot be consistently estimated for a in the range 
of interest. On the other hand, the power law behavior described above tells us that 
a can be consistently estimated. 

• Studying (6) one can see that when n increases, the parameter 9 becomes less and less 
important. However, it describes how much “social” are the customers: the smaller 9 
the more the customers tend to seat to already occupied tables. Thus, it determines 
the sizes of the big tables, but it won’t be much important for our application (the 
more rare DNA types). 
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Given n„ distributed according to Pitman sampling formula (5), it holds that 


( 12 ) 


lim 

n^+oo 


mj{n) aT{j — a) 




Set a.s. Vj 


r(l - a)j\ 

where mj{n), j = 1, ...,n the random number of blocks of the partition IIj^] of size 
j. This result is presented in Gnedin et al. (2007), based on Karlin (1967). 


5. The likelihood ratio. The hypotheses of interest are: 

• Hp = The crime stain was left by the suspect. 

• Hd = The crime stain was left by someone else. 


The LR will thus be defined as 


LR = 


Pi^[n+2t\^p) 

P(^^n+2t\^d) 


p{-K 


Db+ 

N+I] 


p{-K 


Db+ 

[n+1] 




where the last equality is due to the fact that is deterministic, given 

Gorollary 4.1 with A = {A, Q), X = Y = and H = H allows to obtain 

the LR as: 


LR 




vr. 


Db+ 

n+1] 


TTi 


,Db+ 

^+l] 


A, e, H,) I ngy, 
+ e,/;,)|npy+ 


= TT 


Db+ 

[n+1] 


= TT 


Db+ 

[n+1] 


) 

) 


E 


l-A 

n+1+0 


= 7rP'’+ 


‘[n+1] 


[n+1] 


where the last equality is due to (7) and (8). By dehning the random variable <h = n 
we can write the LR as 


l-A 
n+1 + 0 


(13) 


LR = 


n 

IE(<h I n[„+i] = TT[„+!]) 


5.1. True LR. In order to evaluate the performance of this method one would like to 
compare the LR values obtained with (13) with the ‘true’ ones, meaning the LR values 
obtained when vector p is known, which means that we have the list of the frequencies of 
all the DNA types in the population of interest. The LR in this case can be obtained in 
the following way: 


(14) 

(15) 

(16) 

(17) 

(18) 




LR = 


‘[n+2] I'‘[n+1] 


[n+2] I [n+1] ’ 


1 


Pl{Xn+2 = Xn+llTrf^AlvHdtP) 


1 


Pr(A„+2 = Xn+l\xi, ...,Xn+l,TTfAAl]’^d,P)p{xi, ..., X^+i , p) 

{xi,...,Xn + l) 

1 


Px„+iP(xi,...,Xn+l|7r°+’^pp) 

(xi,) 

1 


E(Px„+iK°5_+,p 
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Notice that in the rare type case Xn+i is observed only once among the xi, Xn+i- 
Hence we call it a singleton. Let Ni denote the number of singletons, and S the set of all 
singletons. Given p and it holds that the distribution of is the same as the 

distribution of all other singletons. This implies that: 


Ivr°5_+ , p) = E( ^ 7r°5_+ , p). 

Xi&S 

Let us denote as X^, .., X’^ the K different values taken by Xi, ..., Xn+i-, ordered 
according to the frequency of their values. Stated otherwise, if n* is the frequency of x* 
among xi,...,Xn+i, then rii > n 2 > ... > uk- Moreover, in case X* and X* have the 
same frequency (n* = rij), than they are ordered according to their values. For instance, 
if Xi = 4, X 2 = 2, X^ = 4, X 4 = 2, X 5 = 3, Xg = 3, X 7 = 10, Xg = 13, Xg = 5, Xio = 2, 
Xii = 9, then Xf = 2, X| = 3, X| = 4, X| = 5, X 5 * = 9, X| = 10, X^ = 13. 

By definition, it holds that 

Xi&S j-nj=l 


Notice that (ni,n 2 , ...,nK) is a partition of n + 1, which will be denoted as In 

the example, = (3, 2, 2,1,1,1,1). A more compact representation for can be 

obtained by using two vectors a and r where aj are the distinct numbers occurring in the 
partition, ordered, and each rj is the number of repetitions of aj. J is the length of these 
two vectors, and it holds that n + 1 = example above we have that 

= (a, r) with a = (1, 2, 3) and r = (4, 2,1). 

Since the distribution of px* only depends on , the latter can replace . 

j:nj = l 

Thus, it holds that 


(19) 


LR = 


Xi 


H Px*\^n+t’P) 

j:nj=l 


Notice that the knowledge of p, is not enough to observe XJ‘,...,X^: p is sorted in 
decreasing order, and even if we know the different values pi, we don’t know to which 
category each value belongs. There is a function, y, treated here as latent variable, which 
assigns all DNA types, ordered according to their frequency in Nature, to one of the 
number {1, 2,..., J} corresponding to the position in a of its frequency in the sample, or 
to 0 if the type if not observed. Stated otherwise. 


Y:{1,2,...}^{1,2,...,J}. 



Given tt 


if the ith most common species in Nature is not observed in the sample, 

if the ith most common species in Nature is one of the rj observed aj times in the sample. 

= (a, r), X must satisfy the following conditions: 


(20) = Vi- 

i=l 

The map x can be represented by a vector x = {Xi-,X 2 -, ■■■) made of its values: Xi = a( 0- 
In the example above we have that x = (0, 3, 2, 2,1,0,0,0,1,1,0, 0,1, 0...0). 
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Notice that, given = (a, r), the knowledge of x implies the knowledge of X^, 

indeed it is enough to sort the positive values among the Xi and take their positions 
in X solving ties by considering the position themselves (if Xi = Xji than the order is 
given by i and j). For instance, in the example, if we sort the values of x and we collect 
their positions we get (2,3,4,5,9,10,13): the reader can notice that we got back to the 

X * V"* 

1 , Ay . 

This means that to obtain the distribution of p, which appears in (19), 

it is enough to obtain the distribution of x|7r^^]'",p. Actually, we are only interested 
in the mean of the sum of singletons in samples of size n + 1 from the distribution of 
Xj^,..., p: this means that we can just simulate samples from the distribution of 

x|7r„+i, p and sum those pa such that Xa = 1- 

To simulate samples from the distribution of x\'^n+i^'P we use a Metropolis - Hasting 
algorithm, on the space of the vectors satisfying condition (20). Notice that for the model 
we assumed p to be infinitely long, but for simulations we will use a finite p, of length M. 
This is equivalent to assume that only M elements in the infinite p are positive, and the 
remaining inhnite tail is made of zeros. Then the state space of the Metropolis Hasting 
Markov chain is made of all vectors of length M whose elements belong to {0,1,...,J}, 
and satisfy the condition (20). If we start with a initial point Xo which satisfies (20) and, 
at each allowed move of the Metropolish Hasthing , we swap two different values Xa and 
Xh inside the vector, condition (20) remains satisfied. The algorithm is based on a similar 
one proposed in Anevski et al. (2013). 

This method allows us to obtain the ‘true’ LR when the vector p is known. This is rarely 
the case, but we can put ourselves in a fictitious world where we know p, and compare the 
true values for the LR with the one obtained by applying our model when p is unknown. 
This will be done in the forthcoming section. 

6. Analysis on a real database. In this section we present the study we made on 
a database of 18,925 Y-STR 23-loci profiles from 129 different locations in 51 countries 
in Europe (Purps et ah, 2014) Different analyses are performed by considering only 7 
Y-STR loci (DYS19, DYS389 I, DYS389 H, DYS3904, DYS3915, DY3926,DY3937) but 
similar results have been observed with the use of 10 loci. 

First the maximum likelihood estimators omle and 0 mle using the entire database are 
obtained. Their values are omle = 0.5 and 0 mle = 216. 

In order to check if the choice of the two parameter Poisson Dirichlet prior is a sensible 
one we first compare the ranked frequencies from the database with the relative frequencies 
of several samples of size n obtained from realisations of ^^{oiMLE-,(^MLE)- The asymp¬ 
totic behaviour described in (11) is also checked. Lastly, we will analyse the loglikelihood 
function for data in order to study the denominator of the LR. 

6.1. Model fitting. In Figure 4, the ranked frequencies from the database are com¬ 
pared to the relative frequencies of samples of size n obtained from several realizations of 
PD{aMLE, Omle)- To do so we run several times the Chinese Restaurant seating plan (up 
to n = 18,925 customers): each run is equivalent to generate a new realization p from the 
PD{aMLE, Omle)- The partition of the customers into tables is the same as the partition 
obtained from an i.i.d. sample of size n from p. The ranked relative sizes of each table 
(thin lines) are compared to the ranked frequencies of our database (thick line). 

^The database has previously been cleaned by Mikkel Meyer Andersen (http://people.math.aau.dk/ 
-mikl/?p=y23). 
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Fig 4. Log scale ranked frequencies from the database (thick line) are compared to the relative frequencies 
of samples of size n obtained from realization of PD(aMLE,dMLE) (thin lines). Asymptotic power law 
behavior is also displayed (dotted lines). 


6.2. Asymptotic power law behavior. The asymptotic behavior described in (11) is also 
analyzed in Figure 4. This behavior is expected in the tail (the limit is over i) and clearly 
the number of customers (n = 18, 925) is not big enough for the small Pi to follow the 
power law. 


6.3. Loglikelihood. It is also interesting to investigate the shape of the loglikelihood 
function for a and 9 given It is defined as 

Db++| 


ln+i{a,e) := logp(7r[^^Y|a,6») 


In Figure 5 (a), the loglikelihood function is compared to the Gaussian distribution 
centered in the maximum likelihood estimates for a and 9, with the observed Fisher 
information as covariance matrix. In Figure 5 (b) the loglikelihood reparametrized using 
1 — o: 

(p = n -and 9 instead of a and 9, is displayed and compared to the corresponding 

n + 1 + 9 

Gaussian distribution. 

The posterior distribution for (4>,0) given is proportional to the loglikelihood 

l[n+i]{4>, times the prior p{(j), 9). The Gaussian behavior of l[n+i]{P, 9) is particularly in¬ 
teresting since it allows to conclude that if the prior p(</>, 9) is smooth around {Pmle-, 9mle), 

.Db-b-bN i -- l-aMLE 


we can approximate E(<h|nf fii ) with Pmle = n 

' n + 1 + 9mle 

mate the LR itself in the following way: 


. Hence, one can approxi- 


( 21 ) 


LR: 


^ + 1 + 9mle 

1 — O-MLE 


Notice that this is equivalent to an hybrid approach, in which the parameters are esti¬ 
mated through the MLE (frequentist) and their value plugged into the Bayesian LR. 
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(a) Relative loglikelihood for parameters a and (b) Relative loglikelihood for (f) = and 

8, compared to a Gaussian distribution, 95% and 9 compared to a Gaussian distribution 95% and 
99% confidence intervals (green and red) 99% confidence intervals (green and red). 

Fig 5. 


6.4. Analyzing the error. As explained in Section 5.1, a Metropolis Hasting algorithm, 
based on Anevski et al. (2013), can be used to obtain the ‘true LR’, that is the LR when 
the vector p is known, as defined in (14) - (18). The latter will be denoted as LR|p. This 
can be compared to the LR obtained with the method described in this paper, when p is 
unknown, as defined in (5) - (13). Notice that errors appear at different levels (Cereda, 
20156): error due to limitedness of samples, error in the model, error in the choice of the 
parameters of the model. The following three tests will explore these levels. 

Test 1. Instead of using the big database of Purps et al. (2014) we consider its restric¬ 
tion to the Dutch population (of size 2085): we pretend this to be the entire population of 
possible perpetrators, and compare the distribution of log 2 o(LR|p) and log^oLR obtained 
by 100 samples of size 100 from this population. 

Test 2. In order to avoid error due to model selection, we use as entire population a 
sample from a realization from PD(q:, 0) distribution. This is done by running the two 
parameter Chinese restaurant process up to 2085 customers. Then, again, 100 samples of 
size 100 from this population are sampled and logio(LR|p) and log^g LR are compared. 
This procedure is repeated 5 times to use 5 different Poisson Dirichlet populations. 

Test 3. The same as in Test 2, but in order to avoid also error due to MLE parameter 
estimations, we use as parameters a and 6 for log^Q LR exactly those which have been used 
to generate the Chinese restaurant process. 

7. Future research questions. 

Because of the mutual absolute continuity results we know that 9 cannot be consistently 
estimated. However, there exists at least one consistent estimator for a (Carlton, 1999), 
namely: 

log Kn 

a = - -. 

logn 
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(a) Comparison 


(b) Error 



Fig 6. For Test 1, Test 2 and Test 3 we plot: (a) comparison between the distribution o/logjQ LR when the 
population proportions p are known (LR\p), and when they are not (LR). (b) the error logj^Q Z/_R|p— logjQ LR. 
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Moreover, a can be estimated consistently also from the power law of (11). We are inter¬ 
ested in the consistency of aMLE-, at least when Q is known, although literature (Carlton, 
1999) is quite skeptical about its performance. The observed Fisher information for a 
grows with n and this gives some hope for the consistency of umle- 

The Gaussian behavior of Figure 5 was unexpected. At least, we expect that increasing 
n, a and 9 would become independent, thus the ellipses will rotate. 
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